Part 4: Differential Expression and Pathway Enrichment

Differential expression and Functional enrichment

In this section, we will use the count data from Kallisto, redo some of the QC, and perform the differential expression analysis using DESeq2 (Love, Huber, and Anders 2014). This part is adapted from the official DESeq2 vignette, available on Bioconductor, so feel free to look there for additional information.

Preparing data

suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(pheatmap)
  library(RColorBrewer)
  library(stringr)
  library(plotly)
  library(tidyverse)
  library(tximport)
  library(DESeq2)
})
source("helpers.R") # You can get more functions defined in other R script

This time, we will use the transcript counts from Kallisto rather than the featureCounts from the STAR alignments. The point of this section is to see how similar the pseudoalignment results are to the alignment- dependent methods.

countsDirectoryToUse <- "~/work/gse154927-full/kallisto"
kallistoFilePaths <- list.files(path=countsDirectoryToUse, pattern="*.txt", full.names=T)

First, let’s create our meta-information frame and instantiate some variables we will use later.

# Define meta dataframe for later use
conditions <- c(rep("HCT116_EpCAMhigh", 4), 
                rep("HCT116_EpCAMlow", 4),
                rep("SW480_EpCAMhigh", 4),
                rep("SW480_EpCAMlow", 4))
meta <- data.frame(
  Condition=as.factor(conditions),
  row.names=paste0(conditions, "_", rep(1:4, 4))
)

# Define some general-use parameters for use later
sigThresh <- 10  # significance threshold for the counts
conditionColours <- scales::hue_pal()(length(unique(meta$Condition)))  # choose pretty colours
# associate the conditions with the pretty colours
names(conditionColours) <- unique(meta$Condition)
sampleColours <- conditionColours[meta$Condition]
On Condition factor levels

R will choose a reference level for factors based on alphabetical order. Then, once we do the differential expression with DESeq2, you will never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. There are two solutions: you can either explicitly tell results which comparison to make using the contrast argument (this will be shown later), or you can explicitly set the factors levels. In order to see the change of reference levels reflected in the results names, you need to either run DESeq or nbinomWaldTest/nbinomLRT after the re-leveling operation.

Now, load the by-transcript gtf file to aid us in the annotation of our Kallisto results.

# Load the annotation
transcriptsAnnotation <- "~/work/grch38-p13-gencode-release-42/annotation/genes_annotation_byTranscript.txt"
seqAnno <- getFeatureAnnotation(transcriptsAnnotation, dataFeatureType="transcript")
Loading required package: data.table

Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year
The following objects are masked from 'package:dplyr':

    between, first, last
The following object is masked from 'package:purrr':

    transpose
The following object is masked from 'package:SummarizedExperiment':

    shift
The following object is masked from 'package:GenomicRanges':

    shift
The following object is masked from 'package:IRanges':

    shift
The following objects are masked from 'package:S4Vectors':

    first, second
Loading required package: rtracklayer
seqAnno <- seqAnno[seqAnno$type %in% "protein_coding", ]

Next, we load the counts using tximport , a convenient function for loading in count data from a variety of sources (Salmon, Kallisto, featureCounts, etc.) which also performs the aggregation of transcript counts to the gene level (remember that Kallisto produced count files on the transcript level).

# Get the files
names(kallistoFilePaths) <- rownames(meta)
stopifnot(all(file.exists(kallistoFilePaths)))  # Ensure we have all the files
kallistoFilePaths
                                                 HCT116_EpCAMhigh_1 
"/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMhigh_1.txt" 
                                                 HCT116_EpCAMhigh_2 
"/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMhigh_2.txt" 
                                                 HCT116_EpCAMhigh_3 
"/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMhigh_3.txt" 
                                                 HCT116_EpCAMhigh_4 
"/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMhigh_4.txt" 
                                                  HCT116_EpCAMlow_1 
 "/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMlow_1.txt" 
                                                  HCT116_EpCAMlow_2 
 "/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMlow_2.txt" 
                                                  HCT116_EpCAMlow_3 
 "/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMlow_3.txt" 
                                                  HCT116_EpCAMlow_4 
 "/home/rstudio/work/gse154927-full/kallisto/HCT116_EpCAMlow_4.txt" 
                                                  SW480_EpCAMhigh_1 
 "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMhigh_1.txt" 
                                                  SW480_EpCAMhigh_2 
 "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMhigh_2.txt" 
                                                  SW480_EpCAMhigh_3 
 "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMhigh_3.txt" 
                                                  SW480_EpCAMhigh_4 
 "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMhigh_4.txt" 
                                                   SW480_EpCAMlow_1 
  "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMlow_1.txt" 
                                                   SW480_EpCAMlow_2 
  "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMlow_2.txt" 
                                                   SW480_EpCAMlow_3 
  "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMlow_3.txt" 
                                                   SW480_EpCAMlow_4 
  "/home/rstudio/work/gse154927-full/kallisto/SW480_EpCAMlow_4.txt" 
# Use tximport to load the counts
txiKallisto <- tximport(kallistoFilePaths, type = "kallisto", 
                        tx2gene = seqAnno, 
                        ignoreAfterBar = TRUE, 
                        txIdCol = "transcript_id", 
                        geneIdCol = "gene_id")
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
summarizing abundance
summarizing counts
summarizing length

Take a quick look at the txiKallisto object we generated and take a look at the row names. Contrast this to the row names in the raw files.

# Your code here

txiKallisto is a simple list with matrices, "abundance", "counts", and "length", where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the "length" matrix contains the effective gene lengths.

Finally, let’s construct a DESeqDataSet from the txi object and sample information in meta

dds <- DESeq2::DESeqDataSetFromTximport(txiKallisto,
                                        colData=meta,
                                        design=~Condition)
using counts and average transcript lengths from tximport
dds
class: DESeqDataSet 
dim: 21057 16 
metadata(1): version
assays(2): counts avgTxLength
rownames(21057): ENSG00000000003 ENSG00000000005 ... ENSG00000291239
  ENSG00000291257
rowData names(0):
colnames(16): HCT116_EpCAMhigh_1 HCT116_EpCAMhigh_2 ...
  SW480_EpCAMlow_3 SW480_EpCAMlow_4
colData names(1): Condition

Kallisto QC

Let’s first perform a QC step for these counts, to ensure they match up with the ones from featureCounts we attained in the last exercise.

sigThresh <- 10
isPresent <- counts(dds) > sigThresh
isPresentCond <- rowsum(t(isPresent * 1), group=meta$Condition)
isPresentCond <- t(sweep(isPresentCond, 1,
                         table(meta$Condition)[rownames(isPresentCond)], FUN="/")) >= 0.5
isValid <- rowMeans(isPresentCond) >= 0.5
vsdQC <- DESeq2::vst(dds[isValid,])
using 'avgTxLength' from assays(dds), correcting for library size
# Extract normalized counts
vsdSE <- SummarizedExperiment::assay(vsdQC)

PCA

# Run PCA
pcDat  <- prcomp(t(vsdSE), scale. = FALSE)

# Calculate explained variance
varExp <- (100*pcDat$sdev^2)/sum(pcDat$sdev^2)

# Store the explained variance of top 8 PCs
varExp_df <- data.frame(PC= paste0("PC",1:8),
                          varExp=varExp[1:8])

# Scree plot
varExp_df %>%
  ggplot(aes(x=PC,y=varExp, group=1)) +
  geom_point(colour="steelblue", size=4) +
  geom_col(fill="steelblue") +
  geom_line() + 
  theme_bw() + ylim(c(0,100))

plotPCA(vsdQC, intgroup=c("Condition"), ntop=nrow(vsdQC))
using ntop=14924 top features by variance

MDS

mds <- limma::plotMDS(vsdSE, plot=FALSE)
mdsOut <- mds$eigen.vectors[,1:3]
colnames(mdsOut) <- c("Leading logFC dim1", "Leading logFC dim2", 
                      "Leading logFC dim3")
toPlot <- cbind(meta %>% rownames_to_column("Sample"), mdsOut)
plot_ly(toPlot, x=~`Leading logFC dim1`, y=~`Leading logFC dim2`, z=~`Leading logFC dim3`, color=~Condition, colors="Set1", type='scatter3d', mode='markers+text', text=~Sample, textposition = "top right") %>%
  plotly::layout(title="Classical MDS", scene=list(xaxis=list(title = 'Leading logFC dim1'), yaxis = list(title = 'Leading logFC dim2'), zaxis = list(title = 'Leading logFC dim3')))

Unsupervised Hierarchical Clustering Plot

d <- as.dist(1-cor(vsdSE, use="complete.obs"))
hc <- hclust(d, method="ward.D2")
WGCNA::plotDendroAndColors(hc, sampleColours, autoColorHeight=TRUE, hang = -0.1)

Exercise #1

  1. Take a look back at the rendered html from the previous exercise. Do the results look similar?

  2. Plot the top 2’000 most variable genes in a heatmap.

    # Your code here

Performing the DE

The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described below, in the manual page for ?DESeq .

# Calculate size factors
dds <- estimateSizeFactors(dds, controlGenes = isValid)
using 'avgTxLength' from assays(dds), correcting for library size
sf <- 1 / dds@colData$sizeFactor

# Calculate DESeq
contrastName=c("Condition", "HCT116_EpCAMhigh", "HCT116_EpCAMlow")
dds <- DESeq(dds)
using pre-existing normalization factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
res <- results(dds, contrast=contrastName)
res <- correctDESeq2Output(res, isValid)

You may wonder why we designated all genes above threshold as being “control genes” when calculating the size factors. Though this list of course also includes genes which will presumably be differentially expressed between the two conditions, the majority of genes will not be, and this therefore serves as a good approximation of the control genes in cases where we do not have this information a priori.

Let’s set some thresholds that we will use going forward to filter the DE result. Namely, we will set the p-value, log2-fold change, and adjusted p-value thresholds above/below which we will consider genes to be differentially expressed.

pvalThresh <- 0.01
padjThresh <- 0.05
log2FCThresh <- 0.5

Next, let’s take a closer look at the results.

Exercise #2

  1. What are the columns of the object?
  2. What was the method used for p-value correction?
  3. How many significantly-expressed genes are there assuming we consider genes differentially expressed if p <= pvalThresh and log2FoldChange >= log2FCThresh?

Exploring and Visualizing Results

Let’s first summarise the results table we get.

summary(res)

out of 19532 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 1600, 8.2%
LFC < 0 (down)     : 1814, 9.3%
outliers [1]       : 0, 0%
low counts [2]     : 6050, 31%
(mean count < 9)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

You will notice that there are no low-count genes in the result since we filtered them all out prior to performing the DE.

The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results. Note that the results function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p-value below a given FDR cutoff, alpha. All genes which do not pass the threshold are set to NA.

res05 <- results(dds, contrast=contrastName, alpha=0.05)
summary(res05)

out of 19532 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 1233, 6.3%
LFC < 0 (down)     : 1378, 7.1%
outliers [1]       : 2, 0.01%
low counts [2]     : 4903, 25%
(mean count < 9)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

DESeq2 includes a number of convenience methods for visualizing results. The function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1 (the default alpha value).

plotMA(res, ylim=c(-2,2))

It is more useful visualize the MA-plot for the shrunken log2 fold changes, which removes the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds.

resLFC <- correctDESeq2Output(lfcShrink(dds, contrast=contrastName, type="normal"), isValid)
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895
plotMA(resLFC, ylim=c(-2,2))

Volcano Plot

Let’s next visualise the data in a volcano plot. This type of plot summarises the p-values against log-fold changes as a scatterplot, resulting in a figure resembling a volcano. We will use the packaged EnhancedVolcano (Blighe, Rana, and Lewis 2022) since it works well out of the box.

EnhancedVolcano::EnhancedVolcano(resLFC,
                                 lab=rownames(res),
                                 x = 'log2FoldChange', y = 'pvalue')

Take a look at the output. Each spot in the scatter plot represents a gene with its corresponding position on the plot determined by the p-value and log-fold change. It additionally plots threshold lines and colors the genes based on which areas of the volcano plot the genes fall. Everything in the red areas passes both the log-fold and p-value thresholds. It additionally plots the names of some of the top genes.

However, due to the large effect size for some genes, it becomes difficult to see genes closer to the threshold lines. In addition, we would ultimately like to see gene names instead of gene IDs. To this end, let’s set a maximum x- and y-axis value for our plot and swap the gene IDs out for the gene names.

ymax <- 50
xmax <- 4
seqAnnoMinimal <- seqAnno %>%
  select(gene_id, gene_name) %>%
  group_by(gene_id, gene_name) %>%
  filter(row_number() == 1) %>%
  as.data.frame()
rownames(seqAnnoMinimal) <- seqAnnoMinimal$gene_id
seqAnnoMinimal <- seqAnnoMinimal[rownames(resLFC),]

# Set up the object to plot
toPlot <- res %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  mutate(pvalue=pmax(pvalue, 1/(10^ymax)),
         log2FoldChange=ifelse(log2FoldChange < 0,
                               pmax(log2FoldChange, -xmax),
                               pmin(log2FoldChange, xmax))) %>%
  merge.data.frame(., seqAnnoMinimal, by='gene_id', all = TRUE)

# Plot the EnhancedVolcano
EnhancedVolcano::EnhancedVolcano(toPlot,
  lab = toPlot$gene_name,
  selectLab = head(toPlot$gene_name),  # plot the first genes in our list
  drawConnectors = TRUE,
  arrowheads = FALSE,
  xlim = c(-xmax, xmax),
  ylim = c(0, ymax),
  boxedLabels = TRUE,
  parseLabels = TRUE,
  x = 'log2FoldChange',
  y = 'pvalue')

Exercise #3

  1. The authors write in their paper the following about the predicted mechanism in these samples: It has been established that in breast and pancreatic cancer ZEB1-driven EMT downregulates the expression of the RBP and splicing regulator ESRP1 as part of a self-enforcing feedback loop. Accordingly, among the top differentially expressed genes (DEGs) between EpCAMhi and EpCAMlo in SW480 and HCT116 colon cancer cells, ESRP1 was found to be upregulated both at the RNA and at the protein level in the quasi-mesenchymal subpopulation where ZEB1 expression is downregulated.

Can we confirm in our data that this finding holds water?

Clustering of Significant Features

# Calculate vsd count matrix again
vsd <- assay(DESeq2::vst(dds))[,1:8]
# First, we center the matrix
vsdCentered <- sweep(vsd, 1, rowMeans(vsd))

# Identify DE genes
topGenesIndex <- abs(res$log2FoldChange) >= log2FCThresh & 
  res$pvalue <= pvalThresh
topGenesIndex[!isValid] <- FALSE
topGenes <- rownames(res)[topGenesIndex]

# subset counts
countsToPlot <- vsdCentered[topGenes,]

# Clustering of high variance features
hmObj <- pheatmap(countsToPlot, 
         clustering_method="ward.D2",
         scale = "row", cluster_rows = TRUE,
         show_rownames = FALSE,
         cutree_rows = 6, cutree_cols = 2,
         treeheight_row = 50, treeheight_col = 50,
         annotation_col = meta,
         fontsize_row = 8, fontsize_col = 9,
         annotation_legend = TRUE,
         fontsize=8)

Functional Enrichment Analysis

In this section, we would like to take our results from the differential expression and identify genes in the result which are functionally related. To this end, this so-called enrichment analysis can help us. As mentioned in the lecture, there are several tools in R that allow us to perform this analysis, but we will make use of clusterProfiler (Wu et al. 2021). Its documentation can be found here. Its main strengths lie in a number of built-in visualisations we will find helpful in interpreting the data.

Review of Overrepresentation Analysis

From the docs, overrepresentation analysis (ORA) “is a widely used approach to determine whether known biological functions or processes are over-represented (i.e. enriched) in experimentally-derived gene list”, in our case, a list of differentially expressed genes. ORA should be primarily used when the effect size is large, i.e. when it is possible to use thresholding to attain a good sample of candidate genes.

The p-values for each gene set in the analysis can be calculated using a hypergeometric distribution:

\[ p = 1 - \sum_{0}^{k-1}\frac{{M \choose i}{N-M \choose n-i}}{N \choose i} \]

where \(M\) are the genes assigned to the gene set, \(N\) are all genes in our universe (i.e. all genes originally considered for the DE), \(n\) is the number of genes of interest, and \(k\) are the number of genes of interest which are assigned to the gene set.

Testing whether a set of genes are significantly represented in a set of genes then corresponds to a one-sided Fisher’s exact test.

Overrepresentation Analysis

Since this is the case for our cancer experiment, using ORA this is the recommended way. First, let’s split our genes into up-regulated and down-regulated.

# separate into 'up' and 'down' genes
isSig <- res$pvalue <= pvalThresh
upGenesIndex <- res$log2FoldChange >= log2FCThresh & isSig
downGenesIndex <- res$log2FoldChange <= -log2FCThresh & isSig
upGenesIndex[!isValid] <- FALSE
downGenesIndex[!isValid] <- FALSE
upGenes <- rownames(res)[upGenesIndex]
downGenes <- rownames(res)[downGenesIndex]
validGenes <- rownames(res)[isValid]

Now, we can call the enrichGO function within clusterProfiler on these gene subsets using one of the three gene ontologies as a reference. We will also use the two build-in plotting functions, dotplot and cnetplot.

ego <- clusterProfiler::enrichGO(gene = downGenes,
                                 universe = validGenes, 
                                 OrgDb = "org.Hs.eg.db", 
                                 keyType = "ENSEMBL", 
                                 ont = "BP", 
                                 pAdjustMethod = "BH", 
                                 pvalueCutoff = 0.01, 
                                 qvalueCutoff = 0.05,
                                 readable = FALSE)
# Generate log label mapping for cnet plot
log2RatioLabel <- res$log2FoldChange[downGenesIndex]
names(log2RatioLabel) <- downGenes
log2RatioLabel <- sort(log2RatioLabel, decreasing=TRUE)

# Plot dotplot
clusterProfiler::dotplot(ego) +
  theme(axis.text.y = element_text(vjust = -0.01, size = 8))

# Plot cnetplot
clusterProfiler::cnetplot(ego, 
                          size_category = 2,
                          color.params = list(foldChange = log2RatioLabel))

# Plot goplot
clusterProfiler::goplot(ego)

Exercise #4

  1. Try using the setReadable function to change the ego object such that the labels for the cnetplot are the gene names.
  2. Repeat the process above for the BP term, this time using the up-regulated genes.
  3. Try to run enrichKEGG from the clusterProfiler package.

References

Blighe, Kevin, Sharmila Rana, and Myles Lewis. 2022. “EnhancedVolcano: Publication-Ready Volcano Plots with Enhanced Colouring and Labeling.” https://github.com/kevinblighe/EnhancedVolcano.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2” 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Wu, Tianzhi, Erqiang Hu, Shuangbin Xu, Meijun Chen, Pingfan Guo, Zehan Dai, Tingze Feng, et al. 2021. “clusterProfiler 4.0: A Universal Enrichment Tool for Interpreting Omics Data” 2: 100141. https://doi.org/10.1016/j.xinn.2021.100141.